This article reviews key text-mining capabilities of the R language and proposes methodologies for improvements to text mining large volumes of content, especially email, and novel analytics and visualizations that will enhance the understanding of the content of a corpus. This article builds on existing R-language text-mining APIs to suggest more meaningful and readable analytical capabilities and visualizations. Assuming a general familiarity with the R language, the article starts by looking at baseline capabilities of text mining APIs, and gives examples of enhanced data-cleaning, normalization, data augmentation and visualization methodologies. The aim is to develop a more deliberate process for deriving more relevant information from unstructured data and email in particular. This article is intended as a preliminary study of several promising concepts; further research can to prove out quantitative efficacy of these methods in commercial applications.
Large-scale, unstructured data management solutions often have sophisticated requirements for search, analytics, categorization, and data visualization. These applications include document management, ediscovery, knowledge management, and, most recently, data-privacy management solutions (for GDPR). Customers are initially presented with a promising demo, but find that in production use, the results can be lackluster at best. As a result, some high-potential analytic and data visualization features go largely unutilized in daily use. How can this gap between the demo and customer experience be closed?
Product demos use data that has been manicured to perfection or naturally contain the necessary semantic structure. In benign examples, demo data is clean by design, such as the case with template libraries or Wikipedia articles. Real-world data, however, is unpredictable, noisy, and lacks the semantic elegance of Wikipedia. Nowhere is this more evident than in email data that represents the majority of the data in many unstructured data management implementations.
Email presents some of the greatest challenges to text-mining applications:
This article is intended to present some practical methods and novel visualizations that help to deal with the noisiest of unstructured data. In particular, the data at issue is email that has been scanned as images, and then translated back to text through optical character recognition (OCR). Moreover, like an ediscovery case or knowledge-management solution, the corpus has a known theme that is understood from the outset, and its deficiencies are also well understood (redaction, classified or privileged data missing). The hope is that these methods can take on the challenge of a noisy dataset and illuminate content more quickly that requires further investigation.
During the term of Secretary of State Hillary Clinton, she and her inner circle largely communicated on an email server located at clintonemail.com. Though not the first to use private email servers, Clinton’s particular deployment became the subject of scrutiny, and ultimately, an investigation as it was revealed that material that had been marked “Classified” had been either shared or discussed on an server managed and housed separately from normal Department of State secure network operations. As a result of the investigation, the emails were reviewed by Clinton’s lawyers, and were subsequently released to the public.
Notably, the data, though originating as email, was not released as machine readable text (EML files). Part of the reason for this was the need for meticulous review and redaction of personal and classified material prior to public release. Instead, the data was released in Acrobat (PDF) files derived from image scanning; the only text available was from OCR processing of the data. Therefore, no structured metadata, nor 100% reliable means of extracting the original text was available to the public at the time of release.
The advantage of using this data is principally the challenge. Since the data has nearly zero internal structure, it helps to demonstrate the process, and the value, of programmatically adding structure and organization to this otherwise chaotic data.
The dataset that’s being used has already been extracted as OCR text, and collated into separately-released directories with document IDs as file names. Notably, the data is the raw OCR output, with many errors, which helps to serve as an example of how data cleaning, and effective processing can improve the overall results.
The data contains 27,159 text files, each a single email or released document. The text includes various markings, including security classifications and Bates-stamping with a case # and a document # for review identification. Some pages have been redacted in full, or in part for security reasons. These markings are both helpful and bothersome as will become evident later, but that’s a reality with real-world data.
The list of requisite libraries to run this project are as follows:
library(tm)
library(magrittr)
library(ggplot2)
library(ggfortify)
library(ggalt)
library(ggrepel)
library(scales)
library(plyr)
library(dplyr)
library(tidytext)
library(widyr)
library(factoextra)
library(circlize)
library(slam)
library(wordcloud)
library(textclean)
library(mgsub)
library(logging)
library(tidyr)
library(igraph)
library(ggraph)
library(ggthemes)
library(stringr)
library(anytime)
library(dygraphs)
library(lubridate)
library(anytime)
library(knitr)
library(kableExtra)
The data is first read into a VCorpus object, from which the “tm” package can access the document content:
#sourcePath = file.path(".", "SampleData", "HRC-DataHoarder-Github", "HRCEmail_JuneWeb")
sourcePath = file.path(".", "SampleData", "HRC-DataHoarder-Github")
list.dirs(sourcePath)
## [1] "./SampleData/HRC-DataHoarder-Github"
## [2] "./SampleData/HRC-DataHoarder-Github/Clinton_Email_August_Release"
## [3] "./SampleData/HRC-DataHoarder-Github/HRC_Email_296"
## [4] "./SampleData/HRC-DataHoarder-Github/HRC_NDAS"
## [5] "./SampleData/HRC-DataHoarder-Github/HRCEmail_DecWeb"
## [6] "./SampleData/HRC-DataHoarder-Github/HRCEmail_Feb13thWeb"
## [7] "./SampleData/HRC-DataHoarder-Github/HRCEmail_Jan29thWeb"
## [8] "./SampleData/HRC-DataHoarder-Github/HRCEmail_Jan7thWeb"
## [9] "./SampleData/HRC-DataHoarder-Github/HRCEmail_JulyWeb"
## [10] "./SampleData/HRC-DataHoarder-Github/HRCEmail_JuneWeb"
## [11] "./SampleData/HRC-DataHoarder-Github/HRCEmail_NovWeb"
## [12] "./SampleData/HRC-DataHoarder-Github/HRCEmail_OctWeb"
## [13] "./SampleData/HRC-DataHoarder-Github/HRCEmail_SeptemberWeb"
corpus = VCorpus(DirSource(sourcePath, recursive=TRUE))
corpus
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 27159
There is a large volume of data, so some of the following operations will run very long, especially with insufficient memory and CPU. Experimentation with a single directory might be advisable if making modifications in some areas.
Once data is is loaded into the corpus object, any of the documents can be inspected as follows:
doc = corpus[[1]]$content
paste(doc, collapse=" ")
## [1] " UNCLASSIFIED U.S. Department of State Case No. F-2014-20439 Doc No. C05765907 Date: 08/31/2015 RELEASE IN PART B5 From: Sullivan, Jacob J <SullivanJJ@state.gov> Sent: Friday, December 4, 2009 3:31 AM To: Subject: Iran The EU meets in the coming days, and we are hoping for a strong public - and private - position on Iran. Bill has identified 5 countries that need touching to help drive a good outcome: I know Huma has discussed with you, but a 2-minute discussion with each that underscores the key points reflected on your card would do the trick, if you can swing it. Tx. Also, the intervention, with your modifications, turned out well. The process, in this case, did not generate a good • enough product -- I tried to make it clearer and stronger this morning and your amendments helped a lot. UNCLASSIFIED U.S. Department of State Case No. F-2014-20439 Doc No. C05765907 Date: 08/31/2015 \f"
The first step in any data text-mining effort is to process the data. This involves removing noise such as stop words and numbers from the data, and to normalize the data by setting all the data to lower case, and “stemming” words to their root forms (“discussing” and “discussion” become the root “discuss”). Initially, the standard tools are used, but later in the article, more drastic measures are taken to increase the signal-to-noise ratio and to enable more interesting visualizations.
The next set of examples works with 3 sets of the same data, with various levels of processing:
# The order of operations can become important here; each step has the potential to affect the next.
cleanset <- corpus
cleanset <- tm_map(cleanset, content_transformer(tolower))
cleanset <- tm_map(cleanset, removeNumbers, ucp = FALSE)
cleanset <- tm_map(cleanset, stripWhitespace)
#let's load a much longer list of stopwords
english.stopwords.large = as.character(read.csv(file="long-stopwords.txt", header=FALSE)[,1])
cleanset <- tm_map(cleanset, removeWords, english.stopwords.large)
cleanset <- tm_map(cleanset, removePunctuation)
cleanset.preStem <- cleanset
cleanset <- tm_map(cleanset.preStem, stemDocument, language="en")
For the purposes of this article, some of the more basic operations have been encapsulated in a class called “CorpusSummary” that takes as input the cleanset corpus object just created. The source for this class can be downloaded from the related GitHub Project. Once downloaded, include it with the following:
source("corpusSummary.R")
This is not a formal, available package, but is included as source code with the article to illustrate the overall process. The class performs many key functions on a corpus, returning any of the following information:
All of this work is accomplished by simply creating a CorpusSummary object, as follows:
cs = CorpusSummary$new(cleanset, cleanset.preStem,
k.clusters = 5, k.rounds = 5, sparse.maximal = 0.9,
min.words.per.doc = 3)
## 2018-07-29 00:23:29 INFO::PROCESSING CORPUS WTIH DOCUMENTS: 27159
## 2018-07-29 00:23:29 INFO::...creating Matrix...
## 2018-07-29 00:24:00 INFO::......found 84,260 terms...
## 2018-07-29 00:24:00 INFO::...getting stem completions...
## 2018-07-29 00:24:42 INFO::...weighting the matrix...
## 2018-07-29 00:24:43 INFO::...removing sparse terms at maximal of 0.900000...
## 2018-07-29 00:24:45 INFO::......reduced to 109 terms...
## 2018-07-29 00:24:45 INFO::...creating euclidian distance matrix...
## 2018-07-29 00:24:48 INFO::...finding 5 kmeans clusters over 5 rounds...
## 2018-07-29 00:24:48 INFO::...composing summary...
## 2018-07-29 00:24:49 INFO::--- *** Cluster 1: 20,596 Documents, 26 Terms
## 2018-07-29 00:24:50 INFO::--- *** Cluster 2: 12,519 Documents, 29 Terms
## 2018-07-29 00:24:50 INFO::--- *** Cluster 3: 27,153 Documents, 7 Terms
## 2018-07-29 00:24:51 INFO::--- *** Cluster 4: 9,437 Documents, 31 Terms
## 2018-07-29 00:24:53 INFO::--- *** Cluster 5: 13,085 Documents, 16 Terms
## 2018-07-29 00:24:53 INFO::...DONE processing.
Some results to consider:
The constructor arguments are as follows:
Here are the getters to retrieve the individual results from a CorpusSummary object “cs”.
Get the original corpus that was processed:
c = cs$getCorpus()
Get a distance matrix of type “dist”:
d <- cs$getDist()
There are two ways to get the Document Term Matrix. The default gives only the most relevant terms, or by setting the “sparse” argument to “FALSE”, get the full set of terms (minus stopwords).
dtm <- cs$getDTM() #sparse = TRUE
dtm
## <<DocumentTermMatrix (documents: 27159, terms: 109)>>
## Non-/sparse entries: 651229/2309102
## Sparsity : 78%
## Maximal term length: 21
## Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
#and now the full index, which you'll see is much larger
dtm.full <- cs$getDTM(sparse=FALSE) # get the full DocumentTermMatrix
dtm.full
## <<DocumentTermMatrix (documents: 27159, terms: 84260)>>
## Non-/sparse entries: 2717676/2285699664
## Sparsity : 100%
## Maximal term length: 136
## Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
Get most frequent terms from full or sparse processing:
wf = cs$getMostFrequentTerms(sparse = TRUE)
head(wf[!wf$word %in% rownames(wf),], 10) # true is default
## word freq
## mill mills 338.6136
## messag message 316.4474
## origin original 292.0921
## secretari secretary 224.9036
## januari january 220.0542
## septemb september 216.5710
## juli july 207.4680
## offic office 190.3231
## meet meeting 187.5624
## octob october 186.6647
Each row name shows the stem of the word, while the actual word displayed shows a possible stem completion. This completion is not perfect, by the way, but for display purposes, it’s better than the stem, which can be terse.
Get the list of the most frequent stem completion for each stem:
sct <- cs$getStemCompletionTable()
diffs <- which(as.character(sct$stem) != as.character(sct$completion) & (sct$stem %in% rownames(sct) ))
head(sct[diffs,], 10)
## stem completion
## depart depart department
## unclassifi unclassifi unclassified
## releas releas release
## messag messag message
## origin origin original
## secretari secretari secretary
## mill mill mills
## presid presid president
## govern govern government
## offic offic office
This data is saved and reused often to create more human-readable output in the case of plots, etc.
Given any single document number, retrieve the terms from that document number
head(cs$getTermsFromDoc(doc.number = 1), 10)
## word freq
## iran iran 0.14603876
## clearer clearer 0.14237293
## trick trick 0.13903632
## swing swing 0.12670726
## card card 0.11368405
## stronger stronger 0.11274416
## intervention interventionism 0.09594469
## help help 0.08952631
## discuss discuss 0.08924969
## good good 0.08873535
A Kmeans has been run on the corpus, data used to compose an interesting map of the corpus data in the next step.
kr <- cs$getKmeansResults()
The getSummary method returns a list of paired vectors, one from each Kmeans group returned that give the terms selected for that cluster, and the list of documents meeting the minimum word occurrence threshold.
s <- cs$getSummary()
head(s[[5]]$docList)
## [1] 16905 5856 16904 17820 23107 23110
head(s[[5]]$termList,10)
## word freq
## messag message 316.4474
## origin original 292.0921
## hrodclintonemail hrodclintonemail 245.9795
## full full 232.1467
## secretari secretary 224.9036
## millscdstategov millscdstategov 195.8275
## meet meeting 187.5624
## hdrclintonemail hdrclintonemail 161.7180
## today today 154.3978
## email email 140.8304
Ultimately, presenting data to users involves data visualization. Many data visualization techniques, as shown below, suffer from one of these issues:
The visualizations below quickly demonstrate the problems encountered with raw data visualization. In particular, the relative inability of these visualizations to deliver insight. The remainder of the article focuses largely on incrementally overcoming these deficiencies.
theme_safari = function() {
wb = element_rect(fill="white")
return(theme_fivethirtyeight() +
theme( legend.background = wb,
legend.key = wb,
plot.background = wb,
panel.background = wb,
panel.grid.major = element_blank(),
axis.title.x = element_blank(),
axis.text = element_text(colour="white"),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank(),
plot.title = element_text(hjust=0.5)
))
}
Every visualization suite must have at least one bar graph (or pie chart, which generally should be avoided):
pal <- brewer.pal(6,"Dark2")[-(1)]
ggplot(head(wf, 35), aes(x = reorder(word, freq), y = freq)) +
geom_bar(stat = "identity", fill=pal[2], show.legend=F) +
geom_text(aes(y=0, label=word), hjust=0, nudge_y=10, fontface="bold", color="white") +
coord_flip() +
ggtitle("Relative Word Frequencies for HRC Dataset") +
labs(y = "Ocurrences", x = "Words by Rank") +
theme_safari() +
theme(axis.text.y = element_blank(), axis.text.x = element_blank(),
plot.title = element_text(size="15", face="bold", hjust=.5))
Some form of cluster-map is common, and it’s tempting to want to see what they’ll produce. In this unprocessed visualization, the data and the lahout are just so much noise. Even with typical stop words removed, it’s just not revealing actionable content. Most notably, the terms are split such that Cheryl Mill’s name is not even in the same cluster as her email address, which while mathematically correct, isn’t very useful information in the context of evaluating the corpus.
fviz_cluster(cs$getKmeansResults(), cs$getDist(completions = TRUE),
main="Clustering: Single-word Co-occurrence by Document",
repel=TRUE, labelsize=10, pointsize=2,
ellipse.alpha=.15,
xlab = FALSE, ylab = FALSE,
ggtheme=theme_safari()
)
Most data analysis efforts of any consequence involve normalization. There are many confounding variables that will skew a purely generic approach. In this example, the data is OCR data from email, and it comes with with several associated problems that make generic approaches applied to basic examples (e.g. novels) far less effective. Modern email is almost a language unto itself, and when terms are skewed by OCR errors, more work has to be done.
While the data was run through the out of the box cleaning, more work has to be done to further normalize the data. To start, create a matrix to hold some upcoming regular expressions with replacement values. Getting rid of everyday email clutter will help to process our meaningful data signals better with the kmeans analysis and further visualization efforts.
subs <- matrix(nrow=0, ncol=2)
Email like this tends to include a lot of date/time and planning information, which has skewed our results. Unless this is key to understanding the content—it usually isn’t—this type of content should be removed, or in this case “normalized” into a meaningful token (“DAY”, or “MONTH”) In this case, language is assumed to be English, but of course, this would need to be more sophisticated as to language detection in a commercial application.
subs <- subs %>%
rbind(c("Saturday|Sunday|Monday|Tuesday|Wednesday|Thursday|Friday|Tommorrow|Yesterday", "DAY")) %>%
rbind(c("January|February|March|April|May|June|July|August|September|October|November|December", "MONTH")) %>%
rbind(c("CALENDAR ENTRY:", "CALENDAR"))
This dataset has some additional noise that is particularly bothersome in further analysis:
All of these are removed.
subs <- subs %>%
rbind(c("(UNCLASSIFIED U.?S.? Department of State Case No.? F[0-9\\-]{10,12} Doc No?)|[—•]|Original Message|((Sent: )*(\\w+day), (Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?) \\d+, \\d+ \\d+:\\d+ (AM|PM))", " "))
In this corpus, it would be helpful to better understand the association of key players in the text. Like email in any organization, corpus-specific abbreviations are troublesome. In particular, Hillary Clinton has many abbreviations or alternatives:
The below expression attempts to replace all of these with her email address as a unique identifier. This will be changed again later on.
subs <- subs %>%
rbind(c("(\\sHRC\\s)|(^(To:|From:|CC:)\\s*H2?$)|(Mrs.?\\s*Clinton)|(\\sH\\scalled)", "hrod17@clintonemail.com"))
Removing this can be difficult to generalize. It is tricky to avoid false positives, which depending on the end process could be benign, or could lead to ugly, unintended consequences.
The OCR process introduces errors of all kinds, some of which bias the data greatly. As an example, in this corpus, the copyright (©) symbol was detected where an at sign (@) was the correct choice; a completely understandable and easily remedied error. Another fun one is for Lona Valmoro, whose email ValomoroLJ@state.gov confuses the OCR into thinking that the letters “LJ” are a variety of combinations, including “U”, “l3”, and even “.]” due to the lack of dictionary fix-up capabilities here. Cleaning up this data is essential if Lona Valmoro represents a subject in our review of documents. There are some techniques for fixing these after the fact, the most common one being giving some wiggle room for the number of characters, and some spelling errors in important places. In the below example, there is some additional leeway given to whether “.com” might be followed by a stray character, a common occurrence.
These fix-ups are done only against some of the more important people and email addresses, though it could be generalized more broadly, at a processing cost.
subs <- subs %>%
rbind(c("©state.gov[eil1]?", "@state.gov")) %>%
rbind(c("©clintonemail.com[eil1]?", "@clintonemail.com")) %>%
rbind(c("abedin[[:alpha:]\\(]{1,3}s[tl]ate.gov", "abedinh@state.gov")) %>%
rbind(c("hdr22[[:alpha:]\\(]{1,3]clintonemail.gov", "hdr22@clintonemail.com")) %>%
rbind(c("millscd[[:alpha:]\\(]{1,3}state.gov", "millscd@state.gov")) %>%
rbind(c("sulliva[3jnrial]{3}[©@p]state.gov|Jake.Sulli[vy]an", "sullivanjj@state.gov")) %>%
rbind(c("[1ijf]?vaolmoro[lju1i©@.\\3\\]]{1,4}state.gov[iel1]?", "valmorolj@state.gov"))
One way to normalize is to turn the principal players as unique terms as opposed to name and email-related terms spread all over the bag of words. This will greatly consolidate the clustering job, and will begin to illuminate some information that wasn’t present before. This is accomplished through some relatively intense regular expression work. The first step is to load a directory of key players along with their name and up to two email addresses. Each is then associated with a unique single-term moniker. So for example, Hillary Clinton is now “HILLARYCLINTON” and Huma Abedin is now “HUMAABEDIN”.1
directory <- read.csv(file="directory-data.csv")
| Moniker | First | Last | Email1 |
|---|---|---|---|
| CHERYLMILLS | Cheryl | Mills | millscd@state.gov |
| HILLARYCLINTON | Hillary | Clinton | hrod17@clintonemail.com |
| JAKESULLIVAN | Jacob | Sullivan | sullivanjj@state.gov |
| HUMAABEDIN | Huma | Abedin | abedinh@state.gov |
| LAURENJILOTY | Lauren | Jiloty | JilotyLC@state.gov |
| LONAVALMORO | Lona | Valmoro | ValmoroLJ@state.gov |
| MONICAHANLEY | Monica | Hanley | HanleyMR@state.gov |
Using the directory, a similar regular-expression based search and replace is formed to turn all of these instances into single-term tokens. In this case, a human-readable moniker is used for clarity, but it would be possible to use a unique hash that could be used as a lookup into the directory as well.
d <- as.matrix(directory)
replace.monikers <- matrix(nrow=0, ncol=2, dimnames=list(NULL,c("pattern","replacement")))
for (r in 1:nrow(d)) {
e2 <- d[r,"Email2"]
if (nchar(e2)>0) {
e2 <- paste("|", e2, sep="")
} else {
e2 <- ""
}
if (nchar(d[r, "First"]) > 0 && nchar(d[r, "Last"] > 0)) {
patt <- sprintf("(%s\\s?(%s\\s?)?%s)|(((%s,\\s*%s\\s*(%s)?\\s*)?((\\<\\s*|\\[mailto:)?(%s%s)))(\\s*[\\]\\>])?)|(%s,\\s*%s(\\s?%s)?)",
d[r, "First"], d[r, "Middle"], d[r, "Last"],
d[r, "Last"], d[r, "First"], d[r, "Middle"], d[r, "Email1"], e2,
d[r, "Last"], d[r, "First"], d[r, "Middle"])
if (nchar(d[r,"Nickname"])>0) {
patt <- paste(patt, sprintf("|(%s[\\s.]*%s)", d[r, "Nickname"], d[r, "Last"]), sep="")
}
} else {
patt <- sprintf("(\\<\\s*|\\[mailto:)*(%s%s)(\\s*[\\]\\>])*",
d[r, "Email1"],
e2)
}
#add a space on both sides of the moniker to insure it gets termed correctly
replace.monikers <- replace.monikers %>% rbind(c(patt, d[r,"Moniker"]))
}
Most of the email addresses are now gone and replaced by the monikers:
doc1 <- corpus[[12]]$content
rm <- replace.monikers
rm[,2] <- paste("<mark>", rm[,2], "</mark>", sep="")
doc1p <- mgsub(doc1, subs[,1], subs[,2], ignore.case=T, perl=T, fixed=F) %>%
mgsub(rm[,1], rm[,2], ignore.case=T, perl=T, fixed=F)
cat("<pre><code class='h1js'>", paste(doc1p, collapse=" "), "</code></pre>")
. C05765938 Date: 08/31/2015 RELEASE IN PART B5 From: JAKESULLIVAN Sent: DAY, MONTH 6,2009 11:00 AM HILLARYCLINTON Subject: Re: Jake- We could reach out in the next couple of hours or alternatively early tomorrow. The Eikenberry points sent last night, plus Holbrooke's add'n, provide a good roadmap. Let us know how you'd like to proceed. From: H HILLARYCLINTON To: JAKESULLIVAN i Sent: Sun Dec 06 10:33:09 2009 Subject: Re: Jake-- When can I make the Karzai call? From: JAKESULLIVAN HILLARYCLINTON Cc: HUMAABEDIN Sent: Sun Dec 06 10:30:03 2009 Subject: Fw: Jake-- FYI From: Holbrooke, Richard To: JAKESULLIVANi Sent: Sun Dec 06 08:06:33 2009 Subject: Jake-- H called me DAY when I was out of pocket. Available now if she still wants to talk. Thanks, R . C05765938 Date: 08/31/2015
Here is the raw content from the corpus:
## [1] " UNCLASSIFIED U.S. Department of State Case No. F-2014-20439 Doc No. C05765938 Date: 08/31/2015 RELEASE IN PART B5 From: Sullivan, Jacob J <SullivanJJ@state.gov> Sent: Sunday, December 6,2009 11:00 AM To: H Subject: Re: Jake- We could reach out in the next couple of hours or alternatively early tomorrow. The Eikenberry points sent last night, plus Holbrooke's add'n, provide a good roadmap. Let us know how you'd like to proceed. Original Message From: H <HDR22@clintonemail.com> To: Sullivan, Jacob i Sent: Sun Dec 06 10:33:09 2009 Subject: Re: Jake-- When can I make the Karzai call? Original Message From: Sullivan, Jacob J <Sullivanii@state.gov> To: H Cc: Abedin, Huma <AbedinH@state.gov> Sent: Sun Dec 06 10:30:03 2009 Subject: Fw: Jake-- FYI Original Message From: Holbrooke, Richard To: Sullivan, Jacobi Sent: Sun Dec 06 08:06:33 2009 Subject: Jake-- H called me yesterday when I was out of pocket. Available now if she still wants to talk. Thanks, R UNCLASSIFIED U.S. Department of State Case No. F-2014-20439 Doc No. C05765938 Date: 08/31/2015 \f"
The previously-prepared regular expressions are now part of the cleaning regimen, from which more interesting visualizations can be produced. Before that, the CorpusSummary is re-computed, but after applying the substitutions previously discussed.
cleanset <- corpus
cleanset <- tm_map(cleanset, content_transformer(textclean::mgsub), pattern=subs[,1], replacement=subs[,2], ignore.case=T, perl=T, fixed=F)
cleanset <- tm_map(cleanset, stripWhitespace)
cleanset <- tm_map(cleanset, content_transformer(textclean::mgsub),
pattern=replace.monikers[,1], replacement=replace.monikers[,2],
ignore.case=T, perl=T, fixed=F, leadspace=TRUE, trailspace=TRUE)
cleanset <- tm_map(cleanset, content_transformer(tolower))
cleanset <- tm_map(cleanset, removeNumbers, ucp = FALSE)
cleanset <- tm_map(cleanset, removeWords, words=english.stopwords.large)
cleanset <- tm_map(cleanset, removePunctuation)
cleanset.preStem <- cleanset
cleanset <- tm_map(cleanset.preStem, stemDocument, language="en")
cs2 <- CorpusSummary$new(cleanset, cleanset.preStem,
k.clusters = 4, k.rounds = 5,
sparse.maximal = 0.92,
min.words.per.doc = 3)
## 2018-07-29 00:36:10 INFO::PROCESSING CORPUS WTIH DOCUMENTS: 27159
## 2018-07-29 00:36:10 INFO::...creating Matrix...
## 2018-07-29 00:36:55 INFO::......found 81,575 terms...
## 2018-07-29 00:36:55 INFO::...getting stem completions...
## 2018-07-29 00:37:54 INFO::...weighting the matrix...
## 2018-07-29 00:37:54 INFO::...removing sparse terms at maximal of 0.920000...
## 2018-07-29 00:37:58 INFO::......reduced to 127 terms...
## 2018-07-29 00:37:58 INFO::...creating euclidian distance matrix...
## 2018-07-29 00:38:04 INFO::...finding 4 kmeans clusters over 5 rounds...
## 2018-07-29 00:38:04 INFO::...composing summary...
## 2018-07-29 00:38:05 INFO::--- *** Cluster 1: 1,603 Documents, 8 Terms
## 2018-07-29 00:38:06 INFO::--- *** Cluster 2: 11,767 Documents, 22 Terms
## 2018-07-29 00:38:06 INFO::--- *** Cluster 3: 26,635 Documents, 47 Terms
## 2018-07-29 00:38:08 INFO::--- *** Cluster 4: 15,204 Documents, 50 Terms
## 2018-07-29 00:38:08 INFO::...DONE processing.
Now that key people are tagged, tagged content can now be extracted directly from the summary statistics:
wf <- cs2$getMostFrequentTerms(sparse = FALSE)
mlist <-
data.frame( moniker=directory$Moniker,
displayName=paste(directory$First, directory$Last, sep=" "),
freq=(wf[stemDocument(tolower(directory$Moniker), language="en"),"freq"]),
stem=tolower(directory$Moniker),
row.names="stem",
stringsAsFactors = F
)
ggplot(mlist, aes(x = reorder(displayName, freq), y = freq)) +
geom_bar(fill=pal[2], stat = "identity", show.legend=F) +
geom_text(aes(label=displayName, y=0), hjust=0, nudge_y=10,
fontface="bold", color="white") +
coord_flip() +
ggtitle("Key Persons Relative Occurrences as Terms") +
theme_safari() +
theme(axis.text.y = element_blank(), axis.text.x = element_blank(),
plot.title = element_text(size="15", face="bold", hjust=.5))
In the next set of illustrations, it is helpful to understand that the most frequent terms are not the most interesting always. This is the thought behind TFIDF (Term Frequency-Inverse Document Frequency). A great many words that surface in visualizations are of the frequent-but-uninteresting type. In this corpus, at lesat one of the terms below appears an average of 14 times per document. Keeping them in for statistical reasons, but removing them from visualizations, may help to boost signal with a minimum of effort.
email.form.terms <- c("subject", "cc", "to", "bcc", "sent", "mon", "tue", "wed", "thu", "fri", "sat", "sun", "jan", "feb", "mar", "apr", "jun", "jul", "aug", "sep", "oct", "nov", "dec")
planning.words = c("today", "tomorrow", "anytime", "morning", "noon", "afternoon", "availability", "schedule", "time", "week", "day", "month", "year", "meetings", "meeting", "meet", "release", "pls", "good", "work", "will", "well", "going", "told", "email", "full", "fw", "fyi", "talk", "am", "pm", "shuttle", "udpate", "btw", "calls", "print", "fax", "list", "minutes", "asap", "drive", "plane", "flight", "connect", "todayday", "wrote", "residence", "room", "thx", "forward", "set", "finish", "checking", "depart", "departs", "departing", "arrive", "check", "route", "copy", "cscc", "copying", "messages", "attached", "mailto", "emailed", "confirmed", "mtg", "tonight", "tonite", "talked", "scheduled", "copies", "trip", "talk", "talking", "offered", "sending", "statement", "sbu")
corpus.stopwords = sort(unique(c(email.form.terms, planning.words)))
print(corpus.stopwords, quote=FALSE)
## [1] afternoon am anytime apr arrive
## [6] asap attached aug availability bcc
## [11] btw calls cc check checking
## [16] confirmed connect copies copy copying
## [21] cscc day dec depart departing
## [26] departs drive email emailed fax
## [31] feb finish flight forward fri
## [36] full fw fyi going good
## [41] jan jul jun list mailto
## [46] mar meet meeting meetings messages
## [51] minutes mon month morning mtg
## [56] noon nov oct offered plane
## [61] pls pm print release residence
## [66] room route sat sbu schedule
## [71] scheduled sending sent sep set
## [76] shuttle statement subject sun talk
## [81] talked talking thu thx time
## [86] to today todayday told tomorrow
## [91] tonight tonite trip tue udpate
## [96] wed week well will work
## [101] wrote year
By removing additional noisy stop words as the below comparison word clouds demonstrate. Noise in the dataset is particularly amplified in more complex analytics and visualizations. By combining the normalization, and the additional focus, many visualizations are more engaging. The word cloud on the right has additional important terms that were crowded out by the mundane terms on the left (in gray).
getWordCloudData = function(wf.table, nterms=50, wf.replace = NULL, stopwords=NULL, markwords=NULL) {
w <- wf.table
if (!is.null(stopwords)) {
w <- head(w[-which(w$word %in% stopwords),], nterms)
} else {
w <- head(w, nterms)
}
w <- data.frame(
word = as.character(w$word),
freq = w$freq,
isMarked = ifelse(!is.null(markwords) & (as.character(w$word) %in% markwords), TRUE, FALSE),
isReplaced = FALSE,
row.names = as.character(w$word),
stringsAsFactors = FALSE
)
if (!is.null(wf.replace)) {
mx <- which(w$word %in% rownames(wf.replace))
w[mx, "word"] <- wf.replace[w[mx,"word"], "displayName"]
w[mx, "isReplaced"] <- TRUE
}
return(w)
}
pal <- brewer.pal(6,"Dark2")
pal <- pal[-(1)]
wfc <- getWordCloudData(wf, nterms=100, markwords = corpus.stopwords)
wfc2 <- getWordCloudData(wf, wf.replace=mlist, nterms=100, stopwords=corpus.stopwords)
wfc2$isMarked <- ifelse(wfc2$isReplaced | (wfc2$word %in% wfc$word), FALSE, TRUE)
par(mfrow=c(1,2))
wordcloud(wfc$word, wfc$freq, scale=c(3.5,1),
random.order=FALSE, random.color=FALSE, rot.per = .25, ordered.colors=TRUE,
colors=c("DarkGray", pal)[as.integer(!wfc$isMarked) + 1 + wfc$isReplaced])
wordcloud(wfc2$word, wfc2$freq, scale=c(3.5,1), ordered.colors=TRUE,
random.order=FALSE, random.color=FALSE, rot.per = .25,
colors=c("Black", pal)[as.integer(!wfc2$isMarked) + 1 + wfc2$isReplaced])
The cluster diagram also begins to show real promise. Instead of learning the association of “Cheryl” to “Mills”, now the association is between those principal players in a social network. Visualizations can now safely highlight key players, or otherwise reorganize visualizations in other ways to collate the data effectively. Note that with this normalization and level of cleaning, 5 groupings may be more than is necessary, but again, this is subjective; what to pay attention to is how the groupings are tending to show incrementally more signal as the data is further refined. This is just an incremental step, one of many, that start to improve the overall usefulness of the analytics, and the related visualizations.
getPlottableKmeansData = function(kmeans, dist) {
#code is adapted and customized from factoextra package for better control
p.ind <- scale(dist) %>%
stats::prcomp(scale = FALSE, center = FALSE) %>%
facto_summarize(element = "ind", result = "coord")
colnames(p.ind)[2:3] <- c("x", "y")
p.data <- cbind.data.frame(p.ind,
cluster = as.factor(kmeans$cluster[as.character(p.ind$name)]),
stringsAsFactors=FALSE)
return(p.data)
}
p.data <- getPlottableKmeansData(cs2$getKmeansResults(), cs2$getDist()) %>%
filter(!as.character(name) %in% corpus.stopwords)
p.data$name <- cs2$getStemCompletionTable()[as.character(p.data$name),"completion"]
p.repl <- which(p.data$name %in% rownames(mlist))
p.clgroups <- by(p.data[-p.repl,], p.data[-p.repl,"cluster"], FUN = (
function(c) {
labels <- str_wrap(paste(head(c[order(-c$coord), "name"], 20), collapse=", "), width=1)
cg <- data.frame(
x = mean(c$x), y = mean(c$y),
cluster = as.factor(c$cluster[1]),
isMoniker = FALSE,
label = labels,
labelSize = 5,
force=25,
fontface="plain",
stringsAsFactors = FALSE
)
return(cg)
}
)) %>%
rbind.fill() %>%
rbind(data.frame(
x = p.data$x[p.repl],
y = p.data$y[p.repl],
cluster = p.data$cluster[p.repl],
isMoniker = TRUE,
labelSize = 6,
fontface="bold",
force=200,
label = mlist[as.character(p.data$name[p.repl]), "displayName"],
stringsAsFactors = FALSE
))
ggplot(p.data, aes(x=x, y=y)) +
geom_encircle(data=p.data, show.legend=FALSE, inherit.aes = T, aes(fill=cluster), alpha=.25, expand=.05) +
geom_text_repel(data=p.clgroups[p.clgroups$isMoniker,],
aes(x=x, y=y,
label=str_wrap(label, width=30),
color=cluster, size=labelSize, fontface=fontface),
direction="y", force=10, show.legend = FALSE) +
geom_point(data=p.data[p.repl,],
aes(x=x, y=y, color=cluster), size=4, alpha=.5) +
geom_text_repel(data=p.clgroups[!p.clgroups$isMoniker,],
aes(x=x, y=y,
label=str_wrap(label, width=30),
color=cluster, size=labelSize, fontface=fontface),
# ylim=c(mean(p.clgroups[!p.clgroups$isMoniker,"y"]), max(p.clgroups$y)),
direction="both", force=55, show.legend = FALSE,
lineheight=.8, vjust=.2, segment.alpha = .5) +
geom_point(data=p.data[-p.repl,],
aes(x=x, y=y, color=cluster), size=1) +
geom_point(data=p.clgroups[!p.clgroups$isMoniker,],
aes(x=x, y=y, color=cluster), size=4) +
scale_x_continuous(expand = c(.2, .2)) +
scale_y_continuous(expand = c(.2, .2)) +
scale_size_continuous(range = c(5,8)) +
scale_color_discrete(name="Cluster") +
ggtitle("Clustered Terms by Association") +
theme_safari()
The cluster summary data from this round can be used in a variety of ways, including producing a navigable UI. Enhanced stop words remain for this report.
s <- cs2$getSummary()
s.terms <- lapply(1:length(s), FUN= (function(i) {
mon <- tolower(mlist$moniker)
m <- which(s[[i]]$termList$word %in% mon)
t <- which(!s[[i]]$termList$word %in% mon)
p <- paste(gsub("\\s", " ", mlist[as.character(s[[i]]$termList$word[m]), "displayName"]), collapse=", ")
r <- data.frame(
Cluster=i,
Matching=length(s[[i]]$docList),
People=p,
Terms=paste(s[[i]]$termList$word[t], collapse=", "),
stringsAsFactors = F)
return(r)
})) %>%
rbind.fill()
| Cluster | Matching | People | Terms |
|---|---|---|---|
| 1 | 1,603 | Cheryl Mills, Jacob Sullivan, Huma Abedin | talk, tomorrow, schedule, calls, statement |
| 2 | 11,767 | Hillary Clinton | full, secretary, department, meeting, office, today, email, fyi, report, security, house, morning, cheryl, clinton, david, attachments, point, message, speak, question, thought |
| 3 | 26,635 | government, well, people, support, national, country, public, washington, source, united, senior, american, policy, group, personal, international, including, subject, move, political, leaders, forces, position, change, military, close, breaking, appears, direct, power, interest, general, members, response, original, efforts, continue, clear, service, open, concerned, provide, long, remain, day, month, release | |
| 4 | 15,204 | will, state, work, time, president, discuss, news, good, best, week, minister, send, officials, issues, hope, sure, foreign, confidential, year, help, reason, great, plan, assistance, asked, note, staff, press, conference, told, forward, ambassador, things, hillary, reach, case, wanted, set, visit, development, agreed, election, private, start, affairs, received, head, find, post, number |
There’s a vague sense of some topics. In the next section, N-grams are used to find more interesting terms.
There are two important takeaways here:
This level of processing of data, while requiring significant up front work, can be productized, and improve value of existing analytics techniques and visualizations.
Heading deeper into the data, the “tidytext” package is deployed to extract n-grams: multi-word sequences that occur frequently in a text, and as such, more likely represent meaningful relationships. The following visualization is adapted from the online site for the (quite excellent) book “Text Mining with R” and shows how to use TidyText to build a fairly robust visualization of n-grams.
This basic visualization simply shows the more frequent N-grams connected in a graph to show clusters of N-gram by individual terms. In this case, only bigrams, but longer sequences could also be identified, with diminishing returns above N=3.
hrcemails <- tidy(cleanset.preStem)
count_bigrams <- function(dataset) {
dataset %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
dplyr::count(word1, word2, sort = TRUE)
}
hrcemails.bigrams <- hrcemails %>%
count_bigrams() %>%
filter(!(word1 %in% corpus.stopwords), !(word2 %in% corpus.stopwords))
visualize_bigrams <- function(bigrams) {
set.seed(824)
a <- grid::arrow(type = "closed", length = unit(.08, "inches"))
bigrams %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE, arrow = a) +
geom_node_point(color = "Gray", size = 3) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1, size=4) +
theme_void()
}
visualize_bigrams(head(hrcemails.bigrams, 100))
While this has some promising data, it’s difficult to look at and follow because it has no depth or differentiation. Another problem is that names are still tending to clutter here. The graph can be augmented by adding dimension from outside sources to give each N-gram a context. This is not an exact science, but by prioritizing the identification of context in each N-gram2, it delivers sufficient context to make following the graph more reasonable. By using dictionary-matching, a type is appended to each N-gram as a form of entity extraction. This dictionary matching creates 5 different types:
A new column called “type” acts as a cue to enhance the visualization. The code below isn’t meant to be robust, but rather to demonstrate how entity extraction can be used in practice.4
hrcNgrams <- filter(hrcemails.bigrams,
!str_detect(hrcemails.bigrams$word1, "^\\w$"),
!str_detect(hrcemails.bigrams$word2, "^\\w$"))
hrcNgrams$type <- NA
getBigramRowMatchesFromFile = function(x, file) {
e <- read.csv(file=file, header=T, sep=" ")
return(which(is.na(x$type) & x$word1 %in% e$word1 & x$word2 %in% e$word2))
}
hrcNgrams[getBigramRowMatchesFromFile(hrcNgrams, "titles.txt"), "type"] <- "Title"
hrcNgrams[getBigramRowMatchesFromFile(hrcNgrams, "orgs.txt"), "type"] <- "Org"
first.names <-
tolower(scan(file="First_Names.csv", what="character")) %>%
unique()
last.names <-
tolower(scan(file="Last_Names.csv", what="character")) %>%
unique()
hrcNgrams.names <-
which(is.na(hrcNgrams$type) &
((hrcNgrams$word1 %in% first.names & hrcNgrams$word2 %in% last.names) |
(hrcNgrams$word2 %in% first.names & hrcNgrams$word1 %in% last.names)))
hrcNgrams[hrcNgrams.names,"type"] <- "Name"
countries <- read.csv(file="countries_split.csv")
hrcNgrams.countries <-
which(is.na(hrcNgrams$type) &
(hrcNgrams$word1 %in% countries$first |
(hrcNgrams$word2 %in% countries$first &
hrcNgrams$word2 %in% countries$second)))
hrcNgrams[hrcNgrams.countries,]$type <- "Country"
en2grams <- read.csv(file="en2grams.csv")
hrcNgrams.twograms <-
which(is.na(hrcNgrams$type) &
(hrcNgrams$word1 %in% en2grams$word1 &
hrcNgrams$word2 %in% en2grams$word2))
hrcNgrams[hrcNgrams.twograms,]$type <- "Phrase"
hrcNgrams[is.na(hrcNgrams$type),]$type <- "Other"
| word1 | word2 | n | type |
|---|---|---|---|
| united | states | 7,151 | Country |
| state | department | 6,233 | Org |
| white | house | 4,699 | Org |
| department | state | 3,812 | Org |
| prime | minister | 3,392 | Title |
| secretary | state | 3,187 | Title |
| secretary | office | 3,014 | Org |
| human | rights | 2,544 | Phrase |
| secretary | clinton | 2,448 | Phrase |
| foreign | policy | 2,029 | Phrase |
| middle | east | 1,988 | Phrase |
With this categorized data, the visualization has depth. The graph below demonstrates the value of the type in a low-dimensional graph, but combined with other variables, as will be seen later, the effect on overall understanding of the data is remarkable.
visualize_bigrams_2 <- function(bigrams, verts) {
set.seed(1587)
a <- grid::arrow(type = "closed", length = unit(.08, "inches"))
bigrams %>%
graph_from_data_frame(vertices=verts) %>%
ggraph(layout = "nicely") +
geom_edge_link(aes(color = type), show.legend=FALSE, arrow = a) +
geom_node_point(aes(color = type), size=3, show.legend=TRUE) +
geom_node_text(aes(label = name, color=type),
vjust = 0, hjust = 1, size=4, nudge_x = -.25,
show.legend = FALSE) +
scale_color_brewer(palette="Dark2", type="qual", guide="legend",
direction=-1) +
scale_edge_color_brewer(palette="Dark2", type="qual", guide="none",
direction=-1) +
scale_x_continuous(expand = expand_scale(c(0.175, 0.0))) +
labs(color="Type") +
theme_safari()
}
# This bit of nonsense is necessary to make the graph vertices work; why it
# can't infer the vertices from the edge as an option is a question for a
# different forum.
getVertsFromBigrams = function(bigrams) {
verts <-
data.frame(name=c(bigrams$word1, bigrams$word2),
type=c(bigrams$type, bigrams$type)) %>%
filter(!duplicated(name))
return(verts)
}
hn <- head(hrcNgrams, 125) %>%
filter(!word1 %in% tolower(directory$Moniker), !word2 %in% tolower(directory$Moniker))
verts <- getVertsFromBigrams(hn)
visualize_bigrams_2(hn, verts)
The previous was just correlation of words. It’s interesting as an enhanced word cloud, but doesn’t go beyond a general understanding of topics discussed in the entire corpus. By cross-referencing these n-grams against external contextual clues, it’s possible to glimpse person-to-topic correlations in a reasonably abstract way.
mwc.stopwords <- c(corpus.stopwords, first.names[-which(first.names %in% last.names)])
getWordsFromText = function(texts, stopwords, ids=NULL) {
t <- texts
if (!is.null(ids)) t <- t %>% filter(id %in% ids)
words <- t %>%
select(id, text) %>%
unnest_tokens(word, text) %>%
filter(nchar(word) > 2, !(word %in% stopwords))
return(words)
}
getWordCorrelations = function(w, limit = 100, positive.only = TRUE) {
corr <- w %>%
group_by(word) %>%
filter(n() > limit) %>%
pairwise_cor(word, id, sort=TRUE)
if (positive.only)
corr <- corr %>%
filter(correlation > 0.01)
return(corr)
}
mwc.words <- getWordsFromText(hrcemails, mwc.stopwords)
# Necessary because the correlation should be against stems
mwc.unique.words <- unique(mwc.words$word)
mwc.stem.lookup <- data.frame(word=mwc.unique.words,
stem=stemDocument(mwc.unique.words, language="en"),
row.names = mwc.unique.words)
mwc.wft = cs2$getMostFrequentTerms(sparse = FALSE)
mwc.words$word <- mwc.stem.lookup[mwc.words$word,"stem"]
mwc.corr <- getWordCorrelations(mwc.words)
mwc.corr$freq1 <- mwc.wft[as.character(mwc.corr$item1), "freq"]
mwc.corr$freq2 <- mwc.wft[as.character(mwc.corr$item2), "freq"]
mwc.corr$item1 <- mwc.wft[as.character(mwc.corr$item1), "word"]
mwc.corr$item2 <- mwc.wft[as.character(mwc.corr$item2), "word"]
In the following graph, terms that are highly correlated with a person found in the previously-mentioned directory are plotted. Edges indicate the degree of correlation, which is purely optional, but does convey some additional understanding.
graph.term.by.person = function(corr.words, moniker.list, g.layout="fr") {
all.terms <- c(as.character(corr.words$item1), as.character(corr.words$item2)) %>% unique()
persons = all.terms[all.terms %in% tolower(as.character(moniker.list$moniker))]
persons2 = all.terms[all.terms %in% last.names & !all.terms %in% persons]
terms <- all.terms[!all.terms %in% persons & !all.terms %in% persons2]
gtbp.pal = brewer.pal(8, "Dark2")
gtbp.person.text = gtbp.pal[2]
gtbp.person.node = gtbp.pal[2]
gtbp.person2.text = gtbp.pal[3]
gtbp.person2.node = gtbp.pal[3]
gtbp.term.text = "grey40"
gtbp.term.node = gtbp.pal[1]
verts <- rbind(
data.frame(name=persons, type="KeyPerson", face="bold", size=5,
displayName=moniker.list[persons,"displayName"],
nodeColor=gtbp.person.node, labelColor=gtbp.person.text),
data.frame(name=persons2, type="Person", face="bold", size=4,
displayName=persons2,
nodeColor=gtbp.person2.node, labelColor=gtbp.person2.text),
data.frame(name=terms, type="Term", face="bold", size=4, displayName=terms,
nodeColor=gtbp.term.node, labelColor=gtbp.term.text)
)
set.seed(1234)
g = graph_from_data_frame(corr.words, vertices = verts) %>%
ggraph(layout = g.layout) +
geom_edge_link(aes(edge_alpha = correlation, edge_width = correlation), edge_colour = "DarkGray", show.legend=F) +
geom_node_point(aes(color=nodeColor), size = 4, alpha=.8) +
geom_node_text(aes(label = displayName, fontface=face, color=labelColor, size=type),
repel = TRUE, point.padding = unit(0.3, "lines"),
show.legend=TRUE, check_overlap=TRUE) +
scale_edge_width(range=c(1,3), guide="none") +
scale_edge_alpha(guide="none", range=c(.25, .65)) +
scale_size_manual(guide="none", values=c(6,3,3)) +
scale_color_identity() +
labs(color="Type") +
theme_safari()
return(g)
}
mwc <- mwc.corr %>%
filter(correlation > 0.20, item1 %in% tolower(directory$Moniker), !item2 %in% tolower(directory$Moniker)) %>%
group_by(item1) %>%
top_n(18, correlation) %>%
ungroup()
graph.term.by.person(mwc, mlist, g.layout="fr")
With just a little more data in the graph, including more edges to represent high-correlation connections and 2nd-degree correlations (or people-N-gram connections), some pattners begin to emerge. This really shows the power of visualization: without any modern AI–only using count-based methodologies–a pattern becomes evident. In particular, in this graph, higher-ranking officials (Clinton, Mills, Sullivan) have tighter clusters of N-grams, whereas the staff members tend to have hub-and-spoke patterns with weaker average correlations. Also, the terms for Mills and Sullivan are more saturated with names than the others in the graph, which probably relates to the high degree of headlines and press articles with her name attached; whereas others are communicating with colleagues or setting meetings. The actionable implications could be an interesting area of study, especially if a repeatable scoring mechanism could be achieved.
# Add additional edges to the graph
mwc2 <- mwc.corr %>%
filter(correlation > 0.6, item1 %in% mwc$item2, item2 %in% mwc$item2, !item1 %in% tolower(directory$Moniker), !item2 %in% tolower(directory$Moniker)) %>%
group_by(item1) %>%
top_n(3, correlation) %>%
ungroup()
# Add one additional, 2nd-degree term to the graph
mwc2.terms <- unique(c(as.character(mwc2$item1), as.character(mwc2$item2)))
mwc3 <- mwc.corr %>%
filter(item1 %in% mwc2.terms, !item2 %in% mwc2.terms, !item1 %in% mwc$item1, !item2 %in% mwc$item1, nchar(as.character(item1))>3, nchar(as.character(item2))>3) %>%
group_by(item1) %>%
top_n(3, correlation) %>%
ungroup()
graph.term.by.person(rbind(mwc, mwc2, mwc3), mlist, g.layout = "fr")
The data can also be displayed as a simple report, as seen below:
mwc.rpt.data <- by(mwc, mwc$item1, FUN = (
function(m) {
pt <- as.character(m$item2)
t <- mwc2[seq(1, nrow(mwc2),2),1:2] %>%
filter(item1 %in% pt | item2 %in% pt)
return(
data.frame(
Person = gsub(pattern=" ", replacement=" ", mlist[as.character(m$item1[1]),"displayName"]),
Terms = paste(paste(t$item1, t$item2, sep="-"), collapse=", ")
))
}
)) %>%
rbind.fill() %>%
filter(nchar(as.character(Terms))>0)
| Person | Terms |
|---|---|
| Cheryl Mills | kotecki-mccary, shamim-kazemi, shamim-bastien, bastien-kazemi, mccary-pierre, kara-mcdonald, mccary-kali, kotecki-pierre, kotecki-kali, shamim-kotecki, toiv-nora, bastien-kotecki, kali-kujawinski, louis-mccary, louis-kotecki, kujawinski-kotecki, kenneth-merten, mcdonald-kujawinski, mcdonald-kotecki, lindwall-kujawinski, kenneth-lindwall, kenneth-kujawinski, adams-mcdonald, adams-bastien |
| Hillary Clinton | rand-relieved, rand-sen, benghazi-attack, sen-relieved, rand-attack, terrorist-attack, hear-rand, tragedy-rand, tragedy-attack, misleading-sen, testimony-hear, tragedy-relieved, answer-hear, johnson-rand, happened-hear, happened-rand, dead-attack, answer-happened, dead-rand, johnson-sen, clinton-happened, clinton-rand, difference-happened, emotional-sen, difference-misleading |
| Huma Abedin | scanlon-turkeyarmenia, turkeyarmenia-davutoglu, davutoglu-scanlon |
| Jacob Sullivan | alon-sachar, prem-kumar, sachar-neaipa, alon-neaipa, schlicher-ronald, rubinstein-neaipa, sachar-rubinstein, rubinstein-alon, mara-rubinstein, hale-neaipa, hale-alon, sutphin-sachar, wailes-neaipa, prem-shapiro, mara-alon, shapiro-kumar, jacob-wailes, prem-neaipa, kumar-neaipa, daniel-neaipa, feltman-wailes, dennis-shapiro, jacob-rubinstein |
| Monica Hanley | coleman-claire, pouch-pdb |
Email can assist in telling a story. Treated as individual emails, there may be much interesting data, but aggregating a sufficient volume of email as a continuous document can often be revealing. In the next set of visualizations, events can be correlated the against key events, giving more meaningful results. In the following series of visualizations, events are used as filters to determine event-specific relationships betweeen terms. Combined with the term-types, this produces some powerful, multi-dimensional visualizations.
Excluding any OCR errors, the sent date can be inferred with a regular expression search of the corpus, and used as auxliary metatata.
getFirstSentTime = function(content) {
sent = content[grep(content, pattern="Sent:.*2\\d{3}\\s\\d{1,2}:\\d{2}\\s(AM|PM)")][1]
if (is.na(sent)) return(NA)
dtstr = gsub(sent, pattern="Sent:\\s*", replacement="", perl=T, fixed=F)
dt = as.numeric(strptime(dtstr, "%A, %B %e, %Y %I:%M %p", tz="EST"))
return(dt)
}
metadata <- data.frame("id"=character(), "sent.time"=numeric())
for (i in 1:length(corpus)) {
metadata <- rbind(metadata,
data.frame("id"=corpus[[i]]$meta$id, "sent.time" = getFirstSentTime(corpus[[i]]$content)))
}
cat(sprintf("Missing Time Data = %.0f%%", (length(which(is.na(metadata$sent.time)))*100)/nrow(metadata)))
## Missing Time Data = 8%
A list of events, culled from Secretary Clinton’s time in office can be used to correlate activity and even boost relevance signals relative to each event. This list is adapted from here.
keyEvents <- read.csv("ClintonEvents.csv", header = T, row.names=NULL)
keyEvents$Date <- as.Date(keyEvents$Date, "%Y-%m-%d", origin="1970-01-01", tz="EST")
keyEvents$WindowStart <- keyEvents$Date - 7
keyEvents$WindowEnd <- keyEvents$Date + 23
The following graph gives an overview of the selection of data that will be analyzed in later graphs. As a timeline, it shows the volume of email by month, and also, the volume and range of data that will be evaluated for each event. The question will be: Can isolating this data reveal unique signals within the text for that particular event? An arbitrary window of 7 days before, and 23 days after each key event (30 day window) was selected for analysis on the theory that there is planning leading up to the event, and follow-up and review from analysis of press articles after. This is necessarily a use-case-specific assumption.
First, the event number is stamped on every document with an extracted sent date, and the number of documents for that event is counted.
#This code works as long none of the events is < 30 days apart.
md <- metadata %>% filter(!is.na(sent.time))
md <- data.frame(sent.time=anydate(md$sent.time), id=md$id)
md$event <- sapply(md$sent.time, (
function(x) {
y = which(keyEvents$Date >= (anydate(x) - 7) &
keyEvents$Date <= anydate(x) + 23)
return(y[1])
}
))
md.eventData <- md[!is.na(md$event),]
keyEvents$nDocs <- table(md.eventData$event)
From this metadata, a timeline of email volume with overlaid events is plotted. Overlaid is a an indicator of a sharp drop in available data following the attack on the US Mission in Benghazi, Libya. This is most likely due to higher than normal redaction or classification of the content during that time period.
md.freq <- as.data.frame(table(anydate(md$sent.time), dnn=c("sent.date"))) %>%
group_by(month=floor_date(anydate(sent.date), "1 month"))
md.freq <- aggregate(md.freq$Freq, by=list(month=md.freq$month), FUN=sum)
colnames(md.freq) <- c("month", "freq")
benghazi <- md.freq[md.freq$month >= anydate("2012-09-01") & md.freq$month <= anydate("2013-01-01"),]
ggplot(md.freq, aes(x=month, y=freq)) +
geom_line(stat = "identity", color="DarkBlue", alpha=.50) +
geom_area(stat = "identity", fill="DarkBlue", alpha = 0.10) +
geom_rect(aes(xmin=WindowStart, xmax=WindowEnd, ymin=0, ymax=nDocs),
data=keyEvents, inherit.aes = FALSE,
fill="DarkGreen", color="DarkGreen", alpha=.25) +
geom_text(data=keyEvents, mapping=aes(x=Date, y=nDocs,
label=str_wrap(EventName, 12)),
size=3.5, vjust=0, hjust="center", lineheight=.8, nudge_y = 5, nudge_x = 7,
color="DarkGreen", fontface="bold") +
scale_x_date(breaks = keyEvents$Date,
labels = date_format('%b %d, %Y')) +
# Draw a circle around data related to Benghazi Attack
geom_encircle(data=benghazi, show.legend=FALSE,
fill="Red", alpha=.20, expand=0) +
ylab("Email Count") +
scale_y_continuous(limits=c(0, max(md.freq$freq))) +
labs(title = "Volume of Email by Month and Key Foreign Policy Events") +
theme_fivethirtyeight() +
theme(plot.title = element_text(hjust=.5),
axis.text.x = element_text(hjust=1, angle=45, face="bold", size=12, color="DarkGreen"),
panel.background = element_rect(fill = "white"),
plot.background = element_rect(fill = "white")
)
One thing the graph illustrates is that this event-wise analysis represents a fraction of the overall volume of metadata. The event-related timeframes include only 28% of the overall 1416 days in the dataset, and the selected content represents 28% of the overall data (“law of averages”).
For each event, what are the top words being discussed? First, a list of all corresponding words by event and the the combined relevancy of each pair to the overall dataset, using the TfIdf-weighted DTM results. All this data is collated to produce some illuminating visualizations.
cbe.stopwords = c(
mwc.stopwords,
unique(c(tolower(directory$Moniker), stemDocument(tolower(directory$Moniker), language="en")))
)
cbe <- data.frame(
item1 = character(),
item2 = character(),
correlation = numeric(),
event = numeric())
for (e in 1:nrow(keyEvents)) {
group.ids <- md.eventData[md.eventData$event == e,"id"]
group.words <- getWordsFromText(hrcemails, cbe.stopwords, group.ids) %>%
filter(nchar(word) >= 3, !grepl(word, pattern="stategov[a-z]?$", perl=TRUE, ignore.case=TRUE))
group.corr <- getWordCorrelations(group.words, limit=50)
group.corr$event <- e
cbe <- cbe %>% rbind(group.corr)
}
# remove redundant data
cbe <- cbe[seq(2, nrow(cbe), 2),]
cbe$eventDate <- keyEvents$Date[cbe$event]
cbe$wordpair <- paste(cbe$item1, cbe$item2, sep = "::")
# since we stemmed our matrix, we want to get the weight of the stemmed items, for better overall normalization
cbe$stem1 <- stemDocument(cbe$item1, language="en")
cbe$stem2 <- stemDocument(cbe$item2, language="en")
#safety check to make sure we're working only with stems that are actually in the DTM
dtm.full <- cs2$getDTM(sparse=F)
cbe <- cbe %>% filter(stem1 %in% dtm.full$dimnames$Terms, stem2 %in% dtm.full$dimnames$Terms)
getRelevancyByEvent = function(e.data) {
e <- e.data$event[1]
event.terms <- unique(c(e.data$stem1, e.data$stem2))
cat(sprintf("Compiling relevancy for event %d (%s subject terms): %s \n", e, comma_format()(length(event.terms)), keyEvents$EventName[e]))
event.ids <- unique(as.character(md.eventData[md.eventData$event == e, "id"]))
term.sums <- col_sums(dtm.full[event.ids, event.terms])
e.data$rel1 <- term.sums[e.data$stem1]
e.data$rel2 <- term.sums[e.data$stem2]
cat(sprintf(" %s relevancy scores retrieved\n", comma_format()(nrow(e.data))))
return(e.data)
}
cbe.gd <- cbe %>%
by(cbe$event, getRelevancyByEvent) %>%
rbind.fill() %>%
group_by(event) %>%
top_n(30, (rel1 + rel2)) %>%
ungroup()
## Compiling relevancy for event 1 (61 subject terms): Obama Gives Speech in Cairo (Egypt)
## 1,098 relevancy scores retrieved
## Compiling relevancy for event 2 (131 subject terms): Obama Awarded Nobel Peace Prize
## 4,689 relevancy scores retrieved
## Compiling relevancy for event 3 (281 subject terms): Haiti Aid
## 22,658 relevancy scores retrieved
## Compiling relevancy for event 4 (110 subject terms): START Treaty Signed to Reduce Nuclear Arms
## 3,258 relevancy scores retrieved
## Compiling relevancy for event 5 (101 subject terms): G8 Summit
## 4,025 relevancy scores retrieved
## Compiling relevancy for event 6 (245 subject terms): Obama Announces End to Iraq Combat Mission
## 25,588 relevancy scores retrieved
## Compiling relevancy for event 7 (175 subject terms): Midterm elections
## 12,230 relevancy scores retrieved
## Compiling relevancy for event 8 (229 subject terms): New START Treaty Signed with Russia
## 21,412 relevancy scores retrieved
## Compiling relevancy for event 9 (185 subject terms): Osama Bin Laden killed
## 12,817 relevancy scores retrieved
## Compiling relevancy for event 10 (280 subject terms): Obama announces end of Iraq War
## 27,453 relevancy scores retrieved
## Compiling relevancy for event 11 (298 subject terms): Obama Freezes Iranian Assets
## 32,640 relevancy scores retrieved
## Compiling relevancy for event 12 (174 subject terms): Benghazi Diplomatic Mission Attack
## 11,320 relevancy scores retrieved
## Compiling relevancy for event 13 (78 subject terms): Obama Reelected
## 2,064 relevancy scores retrieved
To make a graph, the data is prepped with a type column and organized into edges and nodes. Pairs of words with high correlations and high relevancy within a specific event column are organized such that the event association is the X axis, relevancy as the Y axis. Since the words appear more than once in the data set, terms that appear in more than one event are merged and averaged.
In the following graph, terms are color-coded terms as before, and additional indication of event association with uppercase representing those terms that are associated only with that event, lowercase terms appear in more than one event. Since terms that appear in more than one event are shown at the average event location, this tends to place more terms closer to the center of the visualization, and as a proportion, single-event terms tend to be more interesting than those that are associated with multiple events, though they may help qualify their paired term, as represented by the connecting arcs.
# rack and stack 'em
getWordType = function(word) {
if (word[1] %in% last.names) {
return("Name")
} else if (word[1] %in% countries$first | word[1] %in% countries$second) {
return("Country")
} else {
return("Other")
}
}
cbe.nodes <-
data.frame(
word = c(as.character(cbe.gd$item1), as.character(cbe.gd$item2)),
event = c(cbe.gd$event, cbe.gd$event),
correlation = c(cbe.gd$correlation, cbe.gd$correlation),
relevancy = c((cbe.gd$rel1 + cbe.gd$rel2)/2, (cbe.gd$rel1 + cbe.gd$rel2)/2),
stringsAsFactors = FALSE
) %>%
by(.$word,
FUN = (
function(w.data) {
w.data$wordType <- getWordType(w.data$word)
avg.relevancy <- mean(w.data$relevancy)
w.data$relevancy <- avg.relevancy
avg.correlation <- mean(w.data$correlation)
w.data$correlation <- avg.correlation
w.data$minEvent <- min(w.data$event)
w.data$maxEvent <- max(w.data$event)
avg.event <- mean(w.data$event)
w.data$event <- avg.event
w.data$point <- ifelse(w.data$minEvent == avg.event & w.data$maxEvent == avg.event, 2, 1)
w.data$x <- avg.event
w.data$y <- avg.relevancy
return(unique(w.data))
}
)) %>%
rbind.fill()
gg = graph_from_data_frame(cbe.gd, vertices=cbe.nodes, directed=TRUE) %>%
ggraph(layout="manual", node.positions=cbe.nodes) +
geom_edge_arc(aes(edge_alpha = correlation), edge_colour="DarkGray", show.legend=c(edge_alpha=FALSE), curvature = .05) +
geom_node_point(aes(color=wordType, size=point), show.legend = c(color=TRUE, size=FALSE)) +
geom_text_repel(aes(x=x, y=y, label = ifelse(point==2, toupper(name), tolower(name)), color=wordType), show.legend = FALSE, segment.size=0.25, point.padding = unit(0.2, "lines"), force=1.1) +
scale_color_manual(name="Type", values=c(Other="DarkSlateGray", Name="Blue", Country="Red", Keyword="DarkOrange")) +
geom_point(data=keyEvents, mapping=aes(x = 1:nrow(keyEvents), y=2.81), color="DarkOrange4") +
geom_text(data=keyEvents, mapping=aes(x = 1:nrow(keyEvents), y=2.9, label=str_wrap(EventName, width = 12)), size=3, fontface="bold", color="DarkOrange4", vjust=0) +
ggtitle("Relevant Keywords Per Major Foreign Policy Event") +
scale_y_continuous(limits=c(2.8,max(cbe.nodes$y)), trans = "log10") +
scale_size_continuous(range=c(1.5, 2)) +
scale_edge_alpha(guide="none") +
theme_safari()
gg
The previous graph presents many variables at once, and helps to show the relative data, but its complexity may be distracting for some viewers. A more structured version with no term-associations may be preferable to some:
cbe.nodes2 <-
data.frame(
word = c(as.character(cbe.gd$item1), as.character(cbe.gd$item2)),
event = c(cbe.gd$event, cbe.gd$event),
relevancy = c((cbe.gd$rel1+cbe.gd$rel2)/2, (cbe.gd$rel1+cbe.gd$rel2)/2),
stringsAsFactors = FALSE
) %>%
by(.$event, FUN=(function(w.data) {
e <- w.data$event[1]
cbe.related <- cbe.gd[cbe.gd$event == e,]
r.data <-
data.frame(
word = character(),
related = character(),
relevancy = numeric(),
related.relevancy = numeric(),
related.rank = integer(),
event = numeric(),
rank = numeric()
)
seen.words = vector()
sorted.words <- w.data[order(-w.data$relevancy),]$word
for (w in sorted.words) {
if (!(w %in% seen.words)) {
related.words <- c(cbe.related[cbe.related$item1 == w,]$item2, cbe.related[cbe.related$item2 == w,]$item1)
related.rel <- c(cbe.related[cbe.related$item1 == w,]$rel2, cbe.related[cbe.related$item2 == w,]$rel1)
seen.words <- unique(c(w, seen.words))
#compose the data
r.row <-
data.frame(
word = w,
wordType = getWordType(w),
related = I(list(related.words)),
relevancy = max(w.data[w.data$word == w,"relevancy"]),
related.relevancy = I(list(related.rel)),
related.rank = I(list(rank(related.rel, ties.method="first"))),
event = e,
rank = NA
)
r.data <- rbind(r.data, r.row)
}
}
r.data$rank = rank(r.data$relevancy, ties.method="first")
return(r.data)
})) %>%
rbind.fill() %>%
unnest()
cbe.nodes2$related.type <- sapply(cbe.nodes2$related, getWordType)
cbe.gd3 = data.frame(
item1 = paste(cbe.gd$item1, cbe.gd$event, sep="."),
item2 = paste(cbe.gd$item2, cbe.gd$event, sep="."),
correlation = cbe.gd$correlation
)
cbe.nodes3 <- unique(data.frame(
name = paste(cbe.nodes2$word, cbe.nodes2$event, sep="."),
x = cbe.nodes2$event,
y = cbe.nodes2$rank,
relevancy = cbe.nodes2$relevancy
))
graph_from_data_frame(cbe.gd3, vertices=cbe.nodes3) %>%
ggraph(layout="manual", node.positions=cbe.nodes3) +
geom_node_text(data=cbe.nodes2, aes(label = word, x = event, y = rank, size=rank, color=wordType), show.legend=c(size=FALSE, color=FALSE)) +
geom_point(data=cbe.nodes2, aes(x = event, y = rank, color=wordType), alpha=0, show.legend=TRUE) +
geom_point(data=keyEvents, mapping=aes(x = 1:nrow(keyEvents), y=0), color="DarkOrange4") +
geom_text(data=keyEvents, mapping=aes(x = 1:nrow(keyEvents), y=-1.6, label=str_wrap(EventName, width = 12)), size=4, color="DarkOrange4", vjust=1, lineheight = .9) +
geom_text(data=keyEvents, mapping=aes(x = 1:nrow(keyEvents), y=-.5, label=format(Date, "%m/%d/%Y")), size=4, fontface="bold", color="DarkOrange4", vjust=1) +
ggtitle("High Relevancy Terms by Major Foriegn Policy Events") +
scale_y_continuous(limits=c(-4.6,max(cbe.nodes3$y))) +
scale_size_continuous(range=c(3.5,4.75), guide="none") +
scale_color_manual(name="Type", values=c(Other="DarkSlateGray", Name="Blue", Country="Red", Keyword="DarkOrange"), guide="legend") +
guides(colour = guide_legend("Type", override.aes = list(size = 4, alpha = 1), reverse=TRUE)) +
theme_safari() + theme(plot.title = element_text(hjust=0.5))
A more complex form of visualization that tracks specific topics over time is a “theme river.” For an example with a nice interactivity model, see this paper on “RoseRiver”, which is a more complicated pursuit, but whose methodology is similar to what is shown here in its selection of topics over time.5
With a little work, basic text-mining can go beyond simple statistical queries to provide more useful visualizations and surface more interesting insights. Starting with raw, unstructured text that was the imperfect result of OCR processing of images, content was processed, cleaned, error-corrected, and normalized. Following this, additional use-case specific actions were taken:
This data was then plotted in a variety of novel visualizations to illustrate these associations in ways simple search results, search filters, and simple bar charts or pie charts could not.
The techniques shown are demonstrative of what can be accomplished with existing basic text mining tools. Additional insights may arise from deeper inspection and application of the latest machine-learning capabilities.
© 2018. Mike Safar. All rights reserved.
Hillary Clinton’s email display name is “H”, such a delicate parsing puzzle that it hardly seems accidental.↩
Multiple labels could be provided on each one, but for this discussion, less is more.↩
This source data has been adapted from the Ngrams.info website and cannot be redistributed, but can easily be downloaded from their site.↩
Each of the input files required some tweaking. For example, the list of first names initially didn’t have “Barack” on it, and some names are actually common words like “will”. This kind of tuning is always ongoing; for example, common names change by generation, and immigration patterns change what names might show up in a population. This kind of grooming is necessary to for production systems, but the benefits to selective tuning, for example in an application such as handling a large-scale ediscovery case, are considerable, especially if the tuning can be handled on a per-case basis.↩
Cui, Weiwei and Liu, Shixia and Wu, Zhuofeng and Wei, Hao. “How Hierarchical Topics Evolve in Large Text Corpora,” IEEE Transactions on Visualization and Computer Graphics, vol. 20, iss. 12, pp. 2281-2290, November 2014.↩